Linear Regression
24 January, 2024
You already…
Please install and load the following packages
Access lecture slide from the course landing page
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
lm() function is used to fit linear models in RElmhurst data from openintro package
Call:
lm(formula = gift_aid ~ family_income, data = elmhurst)
Residuals:
Min 1Q Median 3Q Max
-10.1128 -3.6234 -0.2161 3.1587 11.5707
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.31933 1.29145 18.831 < 2e-16 ***
family_income -0.04307 0.01081 -3.985 0.000229 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.783 on 48 degrees of freedom
Multiple R-squared: 0.2486, Adjusted R-squared: 0.2329
F-statistic: 15.88 on 1 and 48 DF, p-value: 0.0002289
Variance of the outcome variable
Variance of the residuals
If we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student’s family income
\((s_{outcome}^ 2 - s_{residual}^2)/s_{outcome}^2\)
(29800 - 22800)/29800 $ $= 24%
There was a reduction of about 24% of the outcome variable’s variation by using information about family income for predicting aid using a linear model
Correlation between the two variables
auto data from ISLR2loans <- openintro::loans_full_schema %>%
mutate(credit_util = total_credit_utilized/total_credit_limit)
loan_model <- lm(interest_rate ~ verified_income + debt_to_income + public_record_bankrupt +term + credit_util + issue_month, data = loans)
summary(loan_model)
Call:
lm(formula = interest_rate ~ verified_income + debt_to_income +
public_record_bankrupt + term + credit_util + issue_month,
data = loans)
Residuals:
Min 1Q Median 3Q Max
-13.0116 -3.1376 -0.7338 2.3464 19.4852
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.234302 0.210123 10.633 < 2e-16 ***
verified_incomeSource Verified 1.099804 0.099626 11.039 < 2e-16 ***
verified_incomeVerified 2.667962 0.117801 22.648 < 2e-16 ***
debt_to_income 0.022763 0.002959 7.692 1.58e-14 ***
public_record_bankrupt 0.489424 0.128773 3.801 0.000145 ***
term 0.154173 0.003975 38.789 < 2e-16 ***
credit_util 4.838323 0.163103 29.664 < 2e-16 ***
issue_monthJan-2018 0.048263 0.108881 0.443 0.657586
issue_monthMar-2018 -0.047001 0.107379 -0.438 0.661606
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.334 on 9965 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.2486, Adjusted R-squared: 0.248
F-statistic: 412.2 on 8 and 9965 DF, p-value: < 2.2e-16
Credit data from ISLR2Collinearity - When collinearity exists between two variables, it is difficult to say how individually one predictor is associated with response.
\[VIF(\hat\beta_j) = \frac{1}{1-R_{X_j|X_{-j}}^2}\]